Chaos caused by Amazon's Outage - Digital Solutions of Chillicothe

Seven days ago, many services on the web went down. Chances are, you were using one of those services at the time. You probably started wondering if your internet was going wonky. Maybe you rebooted your device. Maybe, you even rebooted your modem and router. But after the device came back up, still no go. What was going on?

According to Amazon, it was their bad.

More specifically, it was Amazon’s own infrastructure software that broke. Hopefully, you will stay with me as I explain this, in layman’s terms. See, Amazon has these large-scale systems of distributions that are composed of smaller independent services. These services interact with each other using API’s allowing AWS to operate them independently. So, they split the system into services that are responsible for executing customer requests (the data plane), and services that are responsible for managing and vending customer configuration (the control plane). Amazon Elastic Compute Cloud (EC2) is an example of an architecture that includes a data plane and a control plane. The data plane consists of physical servers where customers’ Amazon EC2 instances run.

The control plane consists of a number of services that interact with the data plane, performing functions like telling each server about the E2 instance that needs run, keeping running EC2 instances up to date with Amazon Virtual Private Cloud, receiving metering data, logs, and metrics emitted by the servers, and deploying new software to the new server. The two things these two planes have in common are data plane and the control plane need to stay in sync with each other, and the size of the data plane fleet exceeds the size of the control plane fleet, frequently by a factor of 100 or more.

What's a Control Plane? The control plane is the part of a network that carries signaling traffic and is responsible for routing. Control packets originate from or are destined for a router. Functions of the control plane include system configuration and management.

What's a Data Plane? The data plane (sometimes known as the user plane, forwarding plane, carrier plane or bearer plane) is the part of a network that carries user traffic. ... The data plane enables data transfer to and from clients, handling multiple conversations through multiple protocols, and manages conversations with remote peers.

Ok, so now you know what control and data planes are, keep reading for Amazon’s explanation as to what happened.

To explain this event, we need to share a little about the internals of the AWS network. While the majority of AWS services and all customer applications run within the main AWS network, AWS makes use of an internal network to host foundational services including monitoring, internal DNS, authorization services, and parts of the EC2 control plane. Because of the importance of these services in this internal network, we connect this network with multiple geographically isolated networking devices and scale the capacity of this network significantly to ensure high availability of this network connection. These networking devices provide additional routing and network address translation that allow AWS services to communicate between the internal network and the main AWS network. At 7:30 AM PST, an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network. This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks. These delays increased latency and errors for services communicating between these networks, resulting in even more connection attempts and retries. This led to persistent congestion and performance issues on the devices connecting the two networks.

Amazon Spokesman

AWS

Did you know? Amazon Web Servies was first conceived in the early 2000's out of a necessity for their own internal services. It was designed so well, and so simple, that they decided to market their setup to other businesses. The first product to be marketed to other businesses was Amazon's S3 (Simple Storage Services). This was before Google Drive, Onedrive, and IBM's cloud services. It took off so well, they decided to expand their services and offer Full computing services known as Amazon's EC2 (Amazon Elastic Compute Cloud). Where Amazon's S3 was just storage, EC2 completed the computer experience by allowing businesses to rent the whole shebang (processor, ram, OS, etc.) and scale it out as necessary.

Services Effected Last Week

So, to be clear, Amazon’s US-EAST-1 Server was the server effected. This caused so many issues especially with their delivery services. Amazon drivers had no idea where their next stop was because the app, they used to be connected to the infrastructure that was down.

If you were playing PUBG, you would have been kicked and couldn’t get re-connected for over 6 hours.

Alot of home automation was effected. People took to Downdetector.com and complained about how Ring’s doorbells were not functioning correctly. Amazon’s Echo devices would spit out ‘sorry I can’t help you now, try again later’.

Rumbas went berserk and rose up against their owners.

Even at Walt Disney World, guests had trouble making reservations for different services and checking how long the lines were for different attractions.

Surprisingly, even Google was affected too. (Remember AWS was around before Google got into the cloud storage businesses and teamed up with Amazon’s S3 product.)

Venmo was affected. If you wanted to check your balance or send and receive money, the app would open and fail.

People became pretty Hangry when they never received their Door Dash food.

Even the sleezy crypto marketing app Robinhood was affected. (Hopefully this put the final nail in their coffin. Time will tell.)

Frequently visited government websites, such as My Social Security—a portal for online accounts accessing the U.S Social Security Administration—also reported disruptions.

Key Takeaways

An hours-long AWS outage took down popular websites, disrupted smart devices, and caused delivery delays at Amazon warehouses.
The outage comes at one of the busiest times of the year and caused widespread disruption as many people continue to work and study from home during the pandemic.
Prominent companies and devices affected include Google, Disney Plus, Venmo, DoorDash, Spotify, Alexa, Ring, and trading-based app Robinhood.
AWS said that it had identified the root cause of the issue, which has caused API console issues, primarily affecting the East Coast.
Earlier Tuesday, outage tracking website Downdetector.com showed more than 24,000 reported issues with AWS.

Chaos caused by Amazon’s Outage

Services Effected Last Week

Key Takeaways

Related posts

Services Effected Last Week

Key Takeaways

Related posts

Congratulations, Your AWS Knowledge Is Now a Historical Artifact

Why Your Saved Passwords Are Basically a Hacker’s Lunch Buffet

Unhackable Security: Brought to You by the Same People Who Said the Titanic Was Unsinkable